摘要 :
Architectures for supercomputing have evolved rapidly over the past ten years on essentially two but possibly converging tracks.Firstly those of the Symmetric Multiprocesing cluster type architecture (i.e. the IBM p-series and the...
展开
Architectures for supercomputing have evolved rapidly over the past ten years on essentially two but possibly converging tracks.Firstly those of the Symmetric Multiprocesing cluster type architecture (i.e. the IBM p-series and the Cray XMT-series) and secondly the fully distributed node architecture exemplified by the IBM Blue Gene series. Both have advantages and disadvantages.Problems in the Science and Engineering world have expanded in a fashion exemplified by Parkinson's Law ("work fills the time available").Science problems,especially those in the physiological domain are themselves defined over vast ranges of scale lengths as will be shown in the following sections.In order to solve problems whose scale lengths vary substantially there are two possible solutions.Either discretise down to the smallest scale with the possibility of producing such large data sets and numbers of equations that the memory requirements become too large for a specific machine or divide the problem into a subset of appropriate length scales and map these discretised sub-domains onto appropriate machine architectures with a fast communication link.The definitions of "appropriate" and "fast" here is determined at present on a case-by-case basis.A generic solution to where the "optimum" boundary should be between differing architectures is a substantial problem in itself.The two architectures each have their own advantages and disadvantages and in the light of this our group at Canterbury have deliberately utilised this to link both compute architectures together to solve a single problem involving the simulation of flow in the cerebro-vasculature. We show that certain mappings of large vascular trees have constraints placed upon them when more than 256 Blue Gene/L processors are used.
收起
摘要 :
Automotive electronics is a rapidly expanding area with an increasing number of driver assistance and infotainment devices becoming standard in new vehicles. A review of current networking standards within vehicles reveals a fragm...
展开
Automotive electronics is a rapidly expanding area with an increasing number of driver assistance and infotainment devices becoming standard in new vehicles. A review of current networking standards within vehicles reveals a fragmented and proprietary situation with several standards such as MOST, CAN and LVDS dominating, all of which are currently being used by various vehicle manufacturers. Due to the cost of employing a range of networking standards, there is a general desire within the automotive industry to converge on the use of the 802.3 Ethernet for all in-vehicle communication between devices. The introduction of in-vehicle cameras to provide driver assistance applications and the associated high bandwidth requirements of multi camera systems has accelerated the demand for a unifying automotive network architecture. This paper presents an overview of current research present in the literature and identifies trends in the field for the future.
收起
摘要 :
The current hybrid architectures, used to accelerate the nodes of the various distributed computing systems running Big Data applications, are mainly based on Nvidia's GPU or Intel's MIC accelerators. These accelerators are marked...
展开
The current hybrid architectures, used to accelerate the nodes of the various distributed computing systems running Big Data applications, are mainly based on Nvidia's GPU or Intel's MIC accelerators. These accelerators are marked by limitations due to their too general and ad hoc structural and architectural features. In this paper, we propose a Map-Scan architecture, as a generalization of a Map-Reduce architecture, more appropriate for the parallel approach in defining the accelerator part of a hybrid system. The paper describes the organization and the architecture of a hybrid system based on our Map-Scan Accelerator (MSA). The degree of parallelism achieved by our proposal is compared with the current implementations. The energy consumption is estimated, by simulation, for the ASIC versions of MSA. We conclude that the Map-Scan approach in defining the accelerator of a hybrid system provides the appropriate solution for accelerating various Big Data applications and linear algebra based applications.
收起
摘要 :
Deep Neural Networks (DNNs) have been driving the mainstream of Machine Learning applications. However, deploying DNNs on modern hardware with stringent latency requirements and energy constraints is challenging because of the com...
展开
Deep Neural Networks (DNNs) have been driving the mainstream of Machine Learning applications. However, deploying DNNs on modern hardware with stringent latency requirements and energy constraints is challenging because of the compute-intensive and memory-intensive execution patterns of various DNN models. We propose an algorithm-architecture co-design to boost DNN execution efficiency. Leveraging the noise resilience of nonlinear activation functions in DNNs, we propose dual-module processing that uses approximate modules learned from original DNN layers to compute insensitive activations. Therefore, we can save expensive computations and data accesses of unnecessary sensitive activations. We then design an Executor-Speculator dual-module architecture with support for balance execution and memory access reduction. With acceptable model inference quality degradation, our accelerator design can achieve 2.24x speedup and 1.97x energy efficiency improvement for compute-bound Convolutional Neural Networks (CNNs) and memory-bound Recurrent Neural Networks (RNNs).
收起
摘要 :
Deep Neural Networks (DNNs) have been driving the mainstream of Machine Learning applications. However, deploying DNNs on modern hardware with stringent latency requirements and energy constraints is challenging because of the com...
展开
Deep Neural Networks (DNNs) have been driving the mainstream of Machine Learning applications. However, deploying DNNs on modern hardware with stringent latency requirements and energy constraints is challenging because of the compute-intensive and memory-intensive execution patterns of various DNN models. We propose an algorithm-architecture co-design to boost DNN execution efficiency. Leveraging the noise resilience of nonlinear activation functions in DNNs, we propose dual-module processing that uses approximate modules learned from original DNN layers to compute insensitive activations. Therefore, we can save expensive computations and data accesses of unnecessary sensitive activations. We then design an Executor-Speculator dual-module architecture with support for balance execution and memory access reduction. With acceptable model inference quality degradation, our accelerator design can achieve 2.24x speedup and 1.97x energy efficiency improvement for compute-bound Convolutional Neural Networks (CNNs) and memory-bound Recurrent Neural Networks (RNNs).
收起
摘要 :
Deep Neural Networks (DNNs) have been driving the mainstream of Machine Learning applications. However, deploying DNNs on modern hardware with stringent latency requirements and energy constraints is challenging because of the com...
展开
Deep Neural Networks (DNNs) have been driving the mainstream of Machine Learning applications. However, deploying DNNs on modern hardware with stringent latency requirements and energy constraints is challenging because of the compute-intensive and memory-intensive execution patterns of various DNN models. We propose an algorithm-architecture co-design to boost DNN execution efficiency. Leveraging the noise resilience of nonlinear activation functions in DNNs, we propose dual-module processing that uses approximate modules learned from original DNN layers to compute insensitive activations. Therefore, we can save expensive computations and data accesses of unnecessary sensitive activations. We then design an Executor-Speculator dual-module architecture with support for balance execution and memory access reduction. With acceptable model inference quality degradation, our accelerator design can achieve 2.24x speedup and 1.97x energy efficiency improvement for compute-bound Convolutional Neural Networks (CNNs) and memory-bound Recurrent Neural Networks (RNNs).
收起
摘要 :
Although an agile approach is standard for software design, how to properly adapt this method to hardware is still an open question. This work addresses this question while building a system on chip (SoC) with specialized accelera...
展开
Although an agile approach is standard for software design, how to properly adapt this method to hardware is still an open question. This work addresses this question while building a system on chip (SoC) with specialized accelerators. Rather than using a traditional waterfall design flow, which starts by studying the application to be accelerated, we begin by constructing a complete flow from an application expressed in a high-level domain-specific language (DSL), in our case Halide, to a generic coarse-grained reconfigurable array (CGRA). As our understanding of the application grows, the CGRA design evolves, and we have developed a suite of tools that tune application code, the compiler, and the CGRA to increase the efficiency of the resulting implementation. To meet our continued need to update parts of the system while maintaining the end-to-end flow, we have created DSL-based hardware generators that not only provide the Verilog needed for the implementation of the CGRA, but also create the collateral that the compiler/mapper/place and route system needs to configure its operation. This work provides a systematic approach for desiging and evolving high-performance and energy-efficient hardware-software systems for any application domain.
收起
摘要 :
In this modern technology era, in System on Chip (SoC) design, performance of processing units in computer is always on demand. By having hardware accelerators in computer system, core processor can offload task to it and this cre...
展开
In this modern technology era, in System on Chip (SoC) design, performance of processing units in computer is always on demand. By having hardware accelerators in computer system, core processor can offload task to it and this creates parallel execution to improve the processing speed. One of the functional blocks that are crucial in Hardware Accelerator's design is Storage Unit, which is used to keep data that is needed for processing or either the processed data. Under conventional hardware accelerator's design, the storage architecture is shaped to suit a certain processing algorithm and this introduced less flexibility in SoC design. In this paper, a novel design of storage architecture that is able to handle multiple accelerator engines and also modular to typical specification of hardware accelerators has been implemented.
收起
摘要 :
In this modern technology era, in System on Chip (SoC) design, performance of processing units in computer is always on demand. By having hardware accelerators in computer system, core processor can offload task to it and this cre...
展开
In this modern technology era, in System on Chip (SoC) design, performance of processing units in computer is always on demand. By having hardware accelerators in computer system, core processor can offload task to it and this creates parallel execution to improve the processing speed. One of the functional blocks that are crucial in Hardware Accelerator's design is Storage Unit, which is used to keep data that is needed for processing or either the processed data. Under conventional hardware accelerator's design, the storage architecture is shaped to suit a certain processing algorithm and this introduced less flexibility in SoC design. In this paper, a novel design of storage architecture that is able to handle multiple accelerator engines and also modular to typical specification of hardware accelerators has been implemented.
收起
摘要 :
We present a scientific computing accelerator on FPGA that uses hundreds of processors working in parallel organized in several SIMD cores. The accelerator is installed within an Ethernet network and acts as a high-performance com...
展开
We present a scientific computing accelerator on FPGA that uses hundreds of processors working in parallel organized in several SIMD cores. The accelerator is installed within an Ethernet network and acts as a high-performance computing server. A prototype is presented for processing solar images and achieves a great performance that can compete with a cluster.
收起